White Wine Exploration by Kevin Foong

This dataset contains 4,898 white wines with 11 variables on various chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (excellent).

Univariate Plots Section

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

This dataset consists of 13 variables including one index which we will ignore leaving us with 11 chemical properties and 1 quality rating.

To get an overview of the data we first take a look at the histograms of all 12 variables, experimenting with the binwidth to capture the right granularity. Alot of variables are long-tailed with high values. To remove the outliers I will replot the histograms, omitting the top 1% of data of those variables which are long-tailed.

This is now a bit clearer. I notice citric acid is normally distributed but has a spike at around 0.48. This is likely due to standard from the wine industry. We see that many of the distributions are normallydistributed while some are not. I read the notes to gain a better understanding of each attribute. We will now target specific attributes to explore further.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

I note that the lowest quality rating is a 3 (20 wines) and the highest quality rating is a 9 (5 wines). This variable can probably be turned into an ordinal factor variable for analysis. We will do this later.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

To make the data more meaningful I decide to create a new variable called alcohol.bucket which will group wines according to alcohol content. I decide to create 1% buckets so 8-9%, 9-10% etc. Our resulting plot shows that most wines contain 9-10% of alcohol.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Residual sugar contains a number of outliers above 20. Outliers are values much greater than the norm. As we can see from the boxplot the mode is only about 5 but there are a few values over 60. As outliers can skew our results we will remove them in our next plot.

## [1] 1

Residual sugar is the amount of sugar in the wine after fermentation has stopped. It appears that most white wines are not particuarly sweet. As indicated in the notes, wine is considered sweet only if it contains over 45 grams/liter of residual sugar. In this case only 1 wine one contains more than 45 grams/liter.

The original plot shows a sizeable number of wines greater than the mean. To gain a better understanding we replot this using a log10 transformation. As previously stated we also remove outliers by limiting residual sugar to a maximum of 30.

The resulting plot shows a bimodial distribution. It is interesting to find that there are as much if not more wines with residual sugar greater than 3 grams/liter than there are less than 3 grams/liter.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 4898 rows and 13 variables. One variable is just a running number which we will ignore. Another variable is the quality rating given by wine tasters and that is our output variable. The other 11 are chemical properties of the wine.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the dataset is the quality rating given to each wine. This is the outcome variable and we are trying to determine which attributes are best for predicting wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

At this stage most of the 11 attributes can potentially influence the quality rating of the wine. Googling wine tasting, I find that an often repeated important criteria is balance, that is, the balance of various components in the wine such as fruity, sweet and sour, bitter and earthy characteristics. Each of these attributes can potentially affect this balance.

Having said that, my exploration so far gives me a better understanding of each attribute. I think alcohol and residual sugar may help support my investigation.

Did you create any new variables from existing variables in the dataset?

Yes I created an alcohol.bucket variable where each bucket is 1% of alcohol. I find grouping wines into these buckets makes it easier to spot trends in the plots.

Of the features you investigated, were there any unusual distributions? Did
you perform any operations on the data to tidy, adjust, or change the form of
the data? If so, why did you do this?

I noticed residual sugar was long-tailed with a substantial amount of data in the long-tail section even after removing the top 1%. To make it easier to observe and understand the data I decided to apply a log 10 tranformation. By doing this we were able to observe a bimodial distribution which we otherwise wouldn’t have been able to observe.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000

I start by exploring the correlations between variables by outputting a correlation matrix of all 12 variables. No attributes are highly correlated with quality. The two highest correlated attributes with a medium correlation are alcohol (0.43) and density (-0.3).

Of the other attributes the following have only slight (low) correlation with quality: fixed acidity (-0.11), volatile acidity (-0.19), chlorides (-0.2) and total sulfur dioxide (-0.17).

I perform a correlation plot of all our variables to get a quick visual overview of the relationships.

The following attributes have the following correlation with alcohol:

The following attributes have the following correlation with density:

This information is useful because as we build our model, we have a better idea of which variables to include or not include in our model. Variables that are correlated with each other may not add much to the model. On the other hand variables that have some correlation with quality but not with each other may be a good candidate.

We will now look closer at correlations betwen specific variables.

## [1] 0.4355747

I decide to change the quality attribute into an ordinal factor (category) so that boxplots / scatterplots of each quality category can be plotted next to each other and we can easily compare the amount of alcohol for each one. The scatterplot will give you an idea of the number of wines under each quality category which the boxplot does not provide. The mean is also calculated as denoted by a red star.

In this plot I find a distinct trend from quality level 5 onwards that as alcohol increases then so does the quality rating.

## [1] -0.3071233

I do the same for density vs quality. As in the univariate analysis section I remove the top 1% to remove outliers. Here we see a trend that when density decreases, quality increases.

## [1] -0.1136628

## [1] -0.194723

## [1] -0.2099344

## [1] -0.1747372

The above 4 plots of fixed acidity, volatile acidity, chlorides and total sulfur dioxide all have low correlation to quality but I have included them to see if I could find any interesting trends. Here are my observations:

Fixed acidity - The plot is almost flat with only a slight decreasing trend that when fixed acidity decreases, quality increases.

Volatile acidity - Overall the trend is somewhat like a sine curve. There is a increase in quality when there is a drop in volatile acididity between the two most populous buckets 5 and 6 but then trend goes opposite direction.

Chlorides - It is interesting to see that the amount of chlorides do seem to suggest a difference in rating. In the quality buckets 5, 6 and 7 where there are the most number of wines, we can see a noticeable trend that as chlorides decreased then quality did increase.

Total Sulfur dioxide - There is a distinct trend between the 3 most populous alcohol buckets 5, 6 and 7 that when total sulfur dioxide decreases, quality rating increases.

## [1] -0.7801376

## [1] 0.8389665

## [1] 0.5298813

Alcohol and residual sugar have a high correlation to density so I take a closer look at their relationships. Looking at the plots I can see a distinct trend that density decreases when alcohol increase and residual sugar decrease and vice versa. This is probably due to the heaviness / lightness of each component which will affect the density.

I also take a look at total sulfur dioxide against density as they have a moderate correlation.

Residual sugar, total sulfur dioxide and chlorides are all moderately correlated with alcohol so I decide to take a closer look these variables.

With residual sugar, looking at the right side of the plot, we can see that the more residual sugar the lower the alcohol. However the opposite does not appear to be true. The less residual sugar does not necessarily mean greater amounts of alcohol as we can see a good spread of data in the lower residual sugar areas.

Total sulfur dioxide also has an interesting pattern. Where total sulfur dioxide is around the middle, alcohol is quite evenly spread out as can be seen by the amount of dots at between 100-150 of total sulfur dioxide. However greater than 150 alcohol starts to drop.

With chlorides between around 0.01 to 0.025 alcohol content tends to be higher. and between around 0.05 to 0.065 alcohol content tends to be lower. In the middle at around 0.03 alcohol tends to be evenly spread out. Also it is interesting that from 0.065 and greater alcohol remains low at between 8-10%

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The attributes with the highest correlation to quality is alcohol followed by density with a low to medium correlation. No attributes were highly correlated to quality.

There were several other attribute that had quite low correlations with quality. These were total fixed acidity, volatile acidity, chlorides and total sulfur dioxide.

I know from reading up on wine tasting that there are many components at play when rating wine - balance being one of them, that is the various components within the wine that make up the taste. With this in mind I didn’t want to discount any of the attributes unnecessarily.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I observed that residual sugar and alcohol are highly correlated with density probably due to the characteristics of each component. When there is more sugar in the wine (heavier) density increased and when there is more alcohol in the wine (lighter) density decreased.

I observed some not straightforward trends with alcohol vs residual sugar. When residual sugar was high it meant lower amounts of alcohol. However when residual sugar was low it did not necessarily mean alocohol was high. Likewise with alcohol vs total sulfur dioxide, when total sulfur dioxide was higher, alcohol tended to be lower. But around the mean and less, alcohol was more evenly spread out. These non linear trends could be due to chemical thresholds.

Alcohol vs chlorides was also interesting in that the majority of wines had around 0.05 of chlorides. Alcohol tended to be evenly spread but then when greater than the mean alcohol starts to drop. Again it could be some chemical thresholds at play.

I also found that on some plots such as volatile acidity vs quality it showed trends in the quality buckets 5, 6 and 7 which decrease in volatile acidity. These buckets also happened to be the most populous buckets. So even though the extreme buckets on either end bucks the trend, it may not matter as much as these buckets have a much smaller number of wines.

What was the strongest relationship you found?

The strongest relationship to our feature variable quality is alcohol followed by density. However alcohol and density have a high correlation between themeselves. Residual sugar is also hghly correlated with density.

Multivariate Plots Section

I know from previous analysis that two highest correlated attributes to quality is density and alcohol so that is my starting point.

I first look at the relationship between density, alcohol and residual sugar and this plot proves what we already know, that all three variables are highly collerated.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

I’m interested to tie this back to our main feature variable quality. I decide to regroup the quality buckets by combining two lowest ratings (3 and 4) into one bucket and the two highest ratings (8 and 9) into another bucket. The reason I do this is because previously I found that, for some variables, trends existed only in the 5, 6, 7 quality rating buckets which was also the most populous. The buckets at the end in comparison had a lot fewer wines. I wondered if I combined these end buckets, whether it would more clearly show any trends. (For example the plots chlorides vs quality and total sulfur dioxide vs quality exhibited this trend.)

I replot the previous graph using facet_wrap on the new quality buckets. In this case we can clearly see the interplay between the 4 variables. For example comparing the 7,9 bucket with the 3,5 bucket we can see that generally lower density, higher alcohol meant a higher rating. Residual sugar also seems to be less in the 7,9 bucket. Also as we found previously less residual sugar did not particularly mean a higher rating and that fact was also reflected in this plot.

Total sulfur dioxide is moderately correlated with both alcohol and density and lowly correlated to quality. I want to take a look at these three variables together. I cut total sulfur dioxide into buckets based on its interquartile range. I also decide to plot total sulfur dioxide vs alcohol and total sulfur dioxide vs density on top of each other so I can compare them.

The two plots both exhbit the expected trends. Visually the alcohol plot looks a bit cleaner probably because alcohol is more correlated to quality than density.

I do the same with chlorides as it is somewhat correlated with density, alcohol and quality. I cut chlorides into buckets based on its interquartile range. We can see quite clear trends especially the plot with alcohol. This plot doesn’t really tell me anything new that I don’t already know.

Previously I found there was a clear trend in volatile acidity vs quality in the middle ratings. I now want to see if volatile acidity strengthens the alcohol vs quality relationship. The plot appears to show that it does. Volatile acidity is more evident in the lower alcohol and lower quality ratings.

In an earlier plot I found that residual sugar when transformed using a log10 scale displayed a bimodial plot which suggests it would lend itself well to 2 categories of less sugar and more sugar. I decide to explore this further by cutting residual sugar into 2 buckets along these lines and comparing it with alcohol and quality.

In this plot I can see trends between the three variables as they are all somewhat correlated. However it is hard to tell how much residual sugar actually strengthens the relationship if any.

It is surprising to find that quality buckets 3 and 4 and buckets 8 and 9 have residual sugar from both buckets in about even amounts. You would have expected to see less residual sugar equate to higher ratings. And this is also despite alcohol being higher in the buckets 8 and 9 compared to 3 and 4.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = w)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = w)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar, 
##     data = w)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     density, data = w)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     density + chlorides, data = w)
## 
## ==========================================================================================
##                          m1            m2            m3            m4            m5       
## ------------------------------------------------------------------------------------------
##   (Intercept)           2.582***      3.017***      2.356***     74.225***     73.271***  
##                        (0.098)       (0.098)       (0.114)      (11.977)      (11.999)    
##   alcohol               0.313***      0.324***      0.375***      0.286***      0.283***  
##                        (0.009)       (0.009)       (0.010)       (0.018)       (0.018)    
##   volatile.acidity                   -1.979***     -2.107***     -2.059***     -2.044***  
##                                      (0.110)       (0.109)       (0.109)       (0.110)    
##   residual.sugar                                    0.027***      0.052***      0.052***  
##                                                    (0.002)       (0.005)       (0.005)    
##   density                                                       -71.546***    -70.514***  
##                                                                 (11.923)      (11.949)    
##   chlorides                                                                    -0.692     
##                                                                                (0.540)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.190         0.240         0.259         0.264         0.264     
##   adj. R-squared        0.190         0.240         0.258         0.263         0.263     
##   sigma                 0.797         0.772         0.763         0.760         0.760     
##   F                  1146.395       773.875       568.789       438.646       351.293     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -5839.391     -5681.776     -5622.083     -5604.126     -5603.301     
##   Deviance           3112.257      2918.264      2847.993      2827.187      2826.235     
##   AIC               11684.782     11371.552     11254.166     11220.251     11220.603     
##   BIC               11704.272     11397.538     11286.649     11259.231     11266.079     
##   N                  4898          4898          4898          4898          4898         
## ==========================================================================================

This linear model accounts for 26.4% of the variance in the quality rating of white wines. I note that volatile acidity which only had some correlation with quality but was lowly correlated with alcohol did increase the R-squared value in the model.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Volatile acidity does seem to strengthen the alcohol vs quality relationship. We can quite clearly see a trend that as you move through the quality rating buckets from low to high that there are more and more alcohol and less and less volatile acidity.

Variables which were correlated with each other, when plotted, displayed the expected trends but was sometimes hard to determine if they actually strengthened each other. Perhaps they did a little but it is hard to work that out visually.

Were there any interesting or surprising interactions between features?

The low quality rating bucket (3 and 4) and high quality rating bucket (8 and 9) almost had the same spread of residual sugar. This bucked the trend of the middle buckets where there the less residual sugar meant more alcohol and higher quality ratings. This might suggest that there are other elements at play here where it is not just a straight residual sugar / alcohol / quality relationship.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes I created a linear model from my dataset. I used alcohol, the most correlated variable with quality as my starting point. I wasn’t sure if density, the second most correlated variable with quality would make alot of difference to the R-squared. In the end it made a slight difference. Volatile acidity a lowly correlated variable made a difference as it was not correlated with alcohol.


Final Plots and Summary

Plot One

Description One

One of my first plots is to explore alcohol against quality. This plot is interesting because there are clear trends between quality buckets 5, 6 and 7. As alcohol increases the quality rating also increases. This can be seen in the following buckets.

  • Bucket 5 - most wines have low levels of alcohol and there are not that many with high levels.
  • Bucket 6 - is more evenly spread out
  • Bucket 7 - is the opposite with more wines with higher levels of alcohol.

What I am also interested to explore in the future are the wines that condradict this trend. There are some wines in quality bucket 8 that have about 9% alcohol. How did these wines with low alcohol receive such high ratings. Likewise there is a smattering of medium alcohol wines that received low quality ratings of 3 and 4. Why did these wines rate so badly?

Plot Two

Description Two

This plot was redesigned from earlier using boxplots instead of a scatterplot.

This plot presents some interesting and not so straight-forward trends. For example it shows that in alcohol buckets 8,9 and 10 the lower the volatile acidity the higher the quality rating. However in the higher alochol buckets 11,12 and 13 volatile acidity seems to be quite even amongst the different quality ratings.

Alcohol on the other hand, as it increases, then generally so does quality as can be seen by the darker bars in the higher alcohol buckets. The relationship between volatile acidity and alcohol is that as alcohol increases then generally so does volatile acidity.

Plot Three

Description Three

This plot provides a good overview of the interplay between 3 attributes of wine - alcohol, density and residual sugar to quality. There is visible correlation between alcohol, density and quality rating - the higher the rating the higher the alcohol and the lower the density. When adding residual sugar to the mix we can see that when wines are more highly rated they tend to have less sugar. However when there is less sugar it didn’t necessarily mean higher ratings as all 4 quality buckets had a good porportion of wines with little residual sugar.


Reflection

Which direction to take in my exploration was the hardest part of this project. I wasn’t very familiar with the chemical properties of wine to begin with so I had to perform some research on wine tasting using Google. In hindsight having a good understanding of the domain in which you are investigating would greatly help guide which direction to take when exploring the data as it is so open ended.

I found some interesting trends when performing some of the plots. Sometimes this lead to further questions which you will try and follow through with further plots. This was quite rewarding especially when the line of enquiry finally lead to insights that wasn’t at first apparent. In some lines of enquiry there were results that followed the expected trends but there were also some parts of the data that bucked the trend. For exampleless residual sugar generally meant more alcohol and a higher quality rating but in the lowest and highest quality ratings buckets the amount of sugar was quite evenly spread out. This is probably due to certain thresholds that when sugar is greater or less than certain values they no longer affect the quality rating.

Generally I found that the exploratary data analysis process is not a linear one. Many times I would revisit univariate or bivariate analysis when performing the multivariate section. One discovery led to further questions which subsequently meant revisiting existing plots or plotting new ones.

For the future I think there is scope to delve deeper into some of the analyses. For example where there are attributes with certain parts that bucked the trend it would be interesting to isolate these areas and see how other variables interact with it. Maybe trends can be established within specific subsets of data.